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[57] ABSTRACT 

An apparatus and method for the robust recognition of 
speech during a call in a noisy environment is presented. 
Specific background noise models are created to model 
various background noises which may interfere in the error 
free recognition of speech. These background noise models 
are then used to determine which noise characteristics a 
particular call has. Once a determination has been made of 
the background noise in any given call, speech recognition 
is carried out using the appropriate background noise model. 

13 Claims, 4 Drawing Sheets 
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FIG. 3 
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FIG. A 



SYSTEM INSTRUCTS CALLER TO SAY: 

"NS437W "BOSTON" "JULY 1ST" 

SYSTEM ANALYZES RESPONSES USING BACKGROUND NOISE MODELS: 



MODEL f RESULTS: 

1 MS437V — JULY 1ST 

2 NS437W BOSTON — ■ 

3 JS521V — — 

4 NS437W BALTIMORE — 

5 PS581W BALTIMORE JUNE 15 

• • • t 

• • • • 

• • • • 

n NS437W BOSTON JULY 1ST 

n+1 NV536W — — 
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SELECTIVE NOISE/CHANNEL/CODING 

MODELS AND RECOGNIZERS FOR 
AUTOMATIC SPEECH RECOGNITION 

FIELD OF THE INVENTION 

The present invention relates to the robust recognition of 
speech in noisy environments using specific noise environ- 
ment models and recognizers, and more particularly, to 
selective noise/channel/coding models and recognizers for 
automatic speech recognition. 

BACKGROUND INFORMATION 10 

Many of the speech recognition applications in current 
use today often have difficulty properly recognizing speech 
in a noisy background environment. Or, if speech recogni- 
tion applications work well in one noisy background 
environment, they may not work well in another. That is, 
when a speaker is speaking into a pick-up microphone/ 
telephone with a background that is filled with extraneous 
noise, the speech recognition application may inco needy 
recognize the speech and is thus prone to error. Thus time 
and effort is wasted by the speaker and the goals of the 20 
speech recognition applications are often not achieved. In 
telephone applications it is often necessary for a human 
operator to then again have the speaker repeat what has been 
previously spoken or attempt to decipher what has been 
recorded. 25 

Thus, there has been a need for speech recognition 
applications to be able to correctly assess what has been 
spoken in a noisy background environment. U.S. Pat. No. 
5,148,489, issued Sep. 15, 1992 to Erell et al., relates to the 
preprocessing of noisy speech to minimize the likelihood of 30 
errors. The speech is preprocessed by calculating for each 
vector of speech in the presence of noise an estimate of clean 
speech. Calculations are accomplished by what is called 
minimum -mean-log-spectral distance estimations using 
mixture models and Markov models. However, the prepro- 35 
cessing calculations rely on the basic assumptions that the 
clean speech can be modeled because the speech and noise 
are uncorrelated. As this basic assumption may not be true 
in all cases, errors may still occur. 

U.S. Pat. No. 4,933,973, issued Jun. 12, 1990 to Porter, 
relates to the recognition of incoming speech signals in 40 
noise. Pre-stored templates of noise-free speech are modi- 
fied to have the estimated spectral values of noise and the 
same signal-to-noise ratio as the incoming signal. Once 
modified, the templates are compared within a processor by 
a recognition algorithm. Thus recognition is dependent upon 45 
proper modification of the noise-free templates. If modifi- 
cation is incorrectly carried out, errors may still be present 
in the speech recognition. 

U.S. Pat. No. 4,720,802, issued Jan. 19, 1988 to Damou- 
lakis et al., relates to a noise compensation arrangement. 50 
Speech recognition is carried out by extracting an estimate 
of the background noise during unknown speech input. The 
noise estimate is then used to modify pre-stored noiseless 
speech reference signals for comparison with the unknown 
speech input. The comparison is accomplished by averaging 55 
values and generating sets of probability density signals. 
Correct recognition of the unknown speech thus relies upon 
the proper estimation of the background noise and proper 
selection of the speech reference signals. Improper estima- 
tion and selection may cause errors to occur in the speech 
recognition. 60 

Thus, as can be seen, the industry has not yet provided a 
system of robust speech recognition which can function 
effectively in various noisy backgrounds. 

SUMMARY OF THE INVENTION 65 

In response to the above noted and other deficiencies, the 
present invention provides a method and an apparatus for 



2 

robust speech recognition in various noisy environments. 
Thus the speech recognition system of the present invention 
is capable of higher performance than currently known 
methods in both noisy and other environments. Additionally, 
the present invention provides noise models, created to 
handle specific background noises, which can quickly be 
determined to relate to the background noise of a specific 
call. 

To achieve the foregoing, and in accordance with the 
purposes of the present invention, as embodied and broadly 
described herein, the present invention is directed to the 
robust recognition of speech in noisy environments using 
specific noise environment models and recognizers. Thus 
models of various noise environments are created to handle 
specific background noises. A real-time system then ana- 
lyzes the background noise of an incoming call, loads the 
appropriate noise model and performs the speech recogni- 
tion task with the model. 

The background noise models, themselves, are created for 
each set of background noise which may be used. Examples 
of the background noises to be sampled as models would be: 
city noise, motor vehicle noise, truck noise, airport noise, 
subway train noise, cellular interference noise, etc. 
Obviously, the models need not only be limited to simple 
background noise. For instance, various models may model 
different channel conditions, different telephone microphone 
characteristics, various different cellular coding techniques, 
Internet connections, and other noises associated with the 
placement of a call wherein speech recognition is to be used. 
Further, a complete set of sub-word models can be created 
for each characteristic by mixing different background noise 
characteristics. 

Actual creation and collection of the models can be 
accomplished in any known manner, or any manner here- 
tofore to be known, as long as the noise sampled can be 
loaded into a speech recognizer. For instance, models can be 
created by recording background noise and clean speech 
separately and later combining the two. Or, models can be 
created by recording speech with the various background 
noise environments present. Or even further, for example, 
the models can be created using signal processing of 
recorded speech to alter it as if it had been recorded in the 
noisy background. 

Determination of which model to use is determined by the 
speech recognition apparatus. At the beginning of a call, a 
sample of the surrounding background environment from 
where the call is being placed is recorded. As introductory 
prompts, or other such messages are being played to the 
caller, the system analyzes the recorded background noise. 
Different methods of analysis may be used. Once the appro- 
priate noise model has been chosen on the basis of the 
analysis, speech recognition is performed with the model. 
The system can also constantly monitor the speech recog- 
nition function, and if it is determined that speech recogni- 
tion is not at an acceptable level, the system can replace the 
chosen model with another. 

The present invention and its features and advantages will 
become more apparent from the following detailed descrip- 
tion with reference to the accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 illustrates a speech recognition apparatus for the 
creation, storage and use of various background noise 
models, according to an embodiment of the present inven- 
tion. 

FIG. 2 illustrates a flow chart for detennination of the 
proper noise model to use, according to an embodiment of 
the present invention. 

FIG. 3 illustrates a flow chart for robust speech recogni- 
tion and, if necessary, model replacement, according to an 
embodiment of the present invention. 
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FIG. 4 illustrates a chart of an example of the selection of preferred embodiment described herein is in the context of 

an appropriate background noise model to be used in the the receipt a simple telephone call, the present invention will 

speech recognition application, according to an embodiment work equally well with any speech transmission technique 

of the present invention. usc ^ thus is not to be limited to the one embodiment. 

5 Once the connection has been made, in step 120, approxi- 

DETAILED DESCRIPTION mately 2 seconds worth of background noise at the caller's 

____ - . . , . . , location is recorded and/or monitored. Of course, various 

FIGS 1 to 4 show a speech recognition apparatus and of time be ^ based ad te receptioo 

method for robust speech recognition m noisy environments and other factors Introductorv meS sages, instructions or the 

according to an embodiment of the present invention. A ^ are then played ^ step 125 mile messages are 

hidden Markov model is created to model a specific back- ™ being played< tne background noise reC orded in step 120 is 

ground noise. When a call is placed, background noise is analyzed by me system in step 13Q £ven wMle the messages 

recorded and analyzed to determine which Markov model is are ^ played to the calleFj me technique of 

most appropriate to use. Speech recognition is then carried echoing cancellatioo may be ^ t0 record md/or monitor 

out using the appropriately determined model. If speech ^ nh&T background no ise. In explanation, the system will 

recognition is not being performed at an acceptable level, the ^ effectively cancel out the messages being played in the 

model may be replaced by another. recording and/or monitoring of the background noise. 

Referring to FIG. 1, various background noises 1, . . . , n, Analysis of the background noise may be accomplished 
n+1 are recorded using known sound collection devices, by one or raore ways signal information, such as the type 
such as pick-up microphones 1, . . . , n, n+1. It is to be of signals DNIS> ^ sigQSL]s7 e tc), channel port 
understood, of course, that any collection technique, number ^ or trunk Hne num5er may be ^ to help restrict 
whether known or heretofore to be known, may be used. The what the background noise is, and thus what background 
various background noises which can be recorded are noise model woldd ^ most suitable. For example, the 
sounds such as: city noise, traffic noise, airport noise, system may determine that a call received over a particular 
subway train noise, cellular interference noise, different line number may more likely than not be from India, 
channel characteristics noise, various different cellular cod- ^ that ^ number k the designated trunk for receiv- 
ing techniques noise, Internet connection noise, etc. Of in g C aUs from India. Further, the location of the call may be 
course, the various individual background characteristics reC ognized by the caller's account number, time the call is 
may also be mixed in infinite variations. For example, placed or other i^ ovm information about the caller and/or 
cellular channel characteristics noise may be mixed with ^ lhe calL Such information could be used as a preliminary 
background traffic noise. It is to be understood, of course, indicator of the existence and type of background noise, 
that other more various background noise may also be Alternatively, or in conjunction with the preceding 
recorded, what is to be recorded is not to be limited and that a ^ of tions or instructions to be posed to 
any means sufficient for the recordation and/or storage of the caUer ^ ^^ponding answers l0 be made by the 
sound may oe usea. ^ caller may be used. These answers may then be analyzed 

The recorded background noise is then modeled to create using eac h model (or a pre-determined maximum number of 

hidden Markov models for use in speech recognizers. Mod- models) to determine which models have a higher correct 

eling is performed in the modeling device 10 using known match percentage. For example, the system may carry on a 

modeling techniques. In this embodiment, the recorded dialog with the caller and instruct the caller to say 

background noise and pre-labeled speech data are put ^ "NS437W", "Boston", and "July 1st". The system will then 

through algorithms which pick out phonemes creating, in analyze each response using the various background noise 

essence, statistical background noise models. As described models. The model(s) with the correct match for each 

in this embodiment then, the models are thus created by response by the caller can then be used in the speech 

recording background noise and clean speech separately and recognition application. An illustration of the above analysis 

later combining the two. 45 me thod is found in FIG. 4. As can be seen, the analysis of 

Of course, it is to be recognized that any method capable the first response "NS437W" is correctly matched by models 

of creating noises models which can be uploaded into a 2, 4 and n. However, only models 2 and n conrecdy matched 

speech recognizer can be used in the present invention. For the second response, and only model n matched all three 

instance, models can be created by recording speech with the responses correctly. Thus model n would be chosen for the 

various background noise environments present. Or, for 50 following speech recognition application, 

example, the models can be created using signal processing Also, if the system is unable to definitively decide which 

of the recorded speech to alter it as if it had been recorded model and/or models yield the best performance in the 

in the noisy background. speech recognition application, the system may either guess, 

The modeled background noise is then stored in an use more than one model by using more than one speech 
appropriate storage device 20. The storage device 20 itself 55 recognizer, or compare parameters of the call's recorded 
may be located at a central network hub, or it may be background noise to parameters contained in each back- 
reproduced and distributed locally. The various stored back- ground noise model. 

ground noise models 1, . . . , n, n+1 are then appropriately Once a call from a particular location has been matched 

accessed from the storage device 20 by a speech recognition to a background noise model, the system can store that 

unit 30 when a call is placed by the telephone user 40. There 60 information in a database. Thus in step 135, a database of 

may, of course, be more than one speech recognition unit 30 which background noise models are most successful in the 

used for any given call. Further, the present invention will proper analysis of the call's background noise can be created 

work equally well with any technique of speech recognition an d stored. This database can later be accessed when another 

using the background noise models. incoming call is received from the same location. For 

Referring to FIG. 2, a call is placed by a user and received 65 example, it has previously been determined, and stored in 

by the telephone company in steps 100 and 110, respec- the database, that a call from a particular location should use 

tively. It is to be recognized, of course, that although the the city noise background noise model in the speech recog- 
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nition application, because that model results in the highest meat. Not only that, but the speech recognition system is 

percentage of correct speech recognitions. Thus the most capable of a higher performance and a lower error rate than 

appropriate model is used. Of course, the system can current systems. Even when the error rate begins to approach 

dynamically update itself by constantly re-analyzing the an unacceptable level, the present system automatically 

call's recorded background noise to detect potential changes 5 corrects itself by switching to a different model(s). 

in the background noise environment. It is to be understood and expected that variations in the 

Once the call's recorded background noise has been principles of construction and methodology herein disclosed 

analyzed, or the database has been accessed to determine m an embodiment may be made by one skilled in the art and 

where the call is coming from and which model is most il * ^tended that such modifications, changes, and substi- 

appropriate, in step 140 the most appropriate background ™ ^Hons are to be included within the scope of the present 

noise model is selected and recalled from the storage means invennon. 

20. Further, alternative background noise models may be What 15 claimed ^ 

ordered on a standby basis in case speech recognition fails 1. A method for the robust recognition of speech in a noisy 

with the selected model. With the most appropriate back- environment, comprising the steps of: 

ground noise model having been selected, and other models 15 receiving the speech; 

ordered on standby, the system proceeds in step 150 to the recording an amount of data related to the noisy environ- 

speech recognition application using the selected model. ment; 

Referring to FIG. 3, in step 160 the selected background analyzing the recorded data; 

noise model is loaded into the speech recognition unit 30. selecting at least one appropriate background noise model 

Here speech recognition is performed using the chosen 20 on the basis of the recorded data; and 

model. There is more than one method by which the speech performing speech recognition with the at least one 

recognition can be performed using the background noise selected background noise model, 

model. The speech utterance by the caller can be routed to 2. The method according to claim 1, further comprising 

a preset recognizer with the specific model(s) needed, or the the step of: 

necessary model(s) may be loaded into the speech recogni- 25 modeling at least one background noise in a noisy envi- 

tion means 30. In step 180 the correctness of the speech ronment to create at least one background noise model, 

recognition is determined. In this manner then, constant 3. The method according to claim 1, further comprising 

monitoring and adjustment can take place while the call is in the step of: 

progress if necessary. ^ determining the correctness of the at least one selected 

Correctness of the speech recognition in step 180 may be background noise model, wherein if the at least one 

accomplished in several ways. If more than one speech selected model is determined to be incorrect, loading at 

recognizer means 30 is being used, the correct recognition of least one other background noise model for use in the 

the speech utterance may be determined by using a voter step of performing speech recognition, 

scheme. That is, each speech recognizer unit 30, using a set 35 4. The method according to claim 1, further comprising 

of models with different background noise characteristics, the step of: 

will analyze the speech utterance. A vote determines what constructing a background noise database for use in 
analysis is correct. For example, if fifty recognizers deter- analyzing the recorded data on the noisy environment, 
mine that "Boston" has been said by the caller, and twenty 5. The method according to claim 4, wherein the back- 
recognizers determine that "Baltimore" has been said, than ground noise database is dynamically updated for each 
the system determines in step 180 that "Boston" must be the location from which data is recorded, 
correct speech utterance. Alternatively, or in conjunction 6. The method according to claim 1, wherein the step of 
with the above method, the system can ask the caller to analyzing the recorded data is accomplished by using at least 
validate the determined speech utterance. For example, the 0 ne of a plurality of signal information, 
system can prompt the caller by asking "Is this correct?". A 45 7. The method according to claim 1, wherein the step of 
determination of correctness in step 180 can thus be made on analyzing the recorded data is accomplished by using a 
a basis of most correct validations by the user and/or lowest correct match percentage for a plurality of background noise 
rejections (rejections could be set high). models determined by an input response. 

If the minimal criteria of correctness is not met, and thus 8. The method according to claim 1, wherein the step of 

the most appropriate background noise model loaded in step 50 performing speech recognition is accomplished by at least 

160 is determined to be an unsuitable choice, a new model one recognizer. 

can be loaded. Thus in step 185, the system returns to step 9. A method for improving recognition of speech sub- 

160 to load a new model, perhaps the model which was jected to noise, the method comprising the steps of: 

previously determined in step 140 to be the next in order. sampling a connection noise; 

The minimal criteria of correctness may be set at any level 55 searching a database for a noise model most closely 

deemed appropriate and most often will be experimentally matching the sampled connection noise; and 

determined on the basis of each individual system and its applying the most closely matching noise model to a 

own separate characteristics. 5p eech recognilion process . 

If the determination in step 180 is that speech recognition 10. The method according to claim 9, wherein the con- 
is proceeding at an acceptable level, then the system can 6 o nection noise includes at least one of city noise, motor 
proceed to carry out the caller's desired functions, as shown vehicle noise, truck noise, traffic noise, airport noise, sub- 
in step 190. way train noise, cellular interference noise, channel condi- 

As such, the present invention has many advantageous tion noise, telephone microphone characteristics noise, eel- 
uses. For instance, the system is able to provide robust hilar coding noise, and Internet connection noise, 
speech recognition in a variety of noisy environments. In 65 11. The method according to claim 9, wherein the noise 
other words, the present invention works well over a gamut model is constructed by modeling at least one connection 
of different noisy environments and is thus easy to imple- noise. 
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12. The method according to claim 9, wherein when a 13. The method according to claim 9, wherein at least one 

speech recognition error rate is determined to be above a speech recognition unit is used, 
predetermined level, the system substitutes the applied noise 

model by applying at least one other noise model. ***** 
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